# Multimodal Generation

Blip Arabic Flickr 8k
MIT
Arabic image captioning model fine-tuned based on BLIP architecture, specifically optimized for the Flickr8k Arabic dataset
Image-to-Text Transformers Supports Multiple Languages
B
omarsabri8756
56
1
GLM 4 32B 0414 GGUF
MIT
GLM-4-32B-0414 is a large language model with 32 billion parameters, comparable in performance to GPT-4o and DeepSeek-V3. It supports both Chinese and English, and excels in code generation, function calling, and complex task processing.
Large Language Model Supports Multiple Languages
G
unsloth
4,680
10
Instancecap Captioner
Other
A visual language model fine-tuned on the instancevid dataset based on Qwen2.5-VL-7B-Instruct, specializing in instance-level image description generation
Image-to-Text Transformers
I
AnonMegumi
14
1
GLM 4 32B 0414
MIT
GLM-4-32B-0414 is a large language model with 32 billion parameters, comparable in performance to the GPT series, supporting both Chinese and English, and excels in code generation, function calling, and complex task processing.
Large Language Model Transformers Supports Multiple Languages
G
THUDM
10.91k
320
Llama 3.2 Vision Instruct Bpmncoder
Apache-2.0
Llama 3.2 11B vision instruction fine-tuned model optimized with Unsloth, using 4-bit quantization technology, achieving 2x faster training speed
Text-to-Image Transformers English
L
utkarshkingh
40
1
Vit Gpt2 Image Captioning
Apache-2.0
This is an image captioning model based on ViT and GPT2 architectures, capable of generating natural language descriptions for input images.
Image-to-Text
V
aryan083
31
0
Cockatiel 13B
A video-text generation model developed based on VILA-v1.5-13B, capable of generating fine-grained descriptive text for input videos that aligns with human preferences.
Video-to-Text Transformers
C
Fr0zencr4nE
26
2
Llama 3.2 11B Vision Invoices Mini
Apache-2.0
A multimodal large language model fine-tuned based on unsloth/llama-3.2-11b-vision-instruct-unsloth-bnb-4bit, supporting visual instruction understanding tasks, with Unsloth optimization doubling training speed.
Text-to-Image Transformers English
L
atulSethi
46
1
Qwenfluxprompt
Apache-2.0
This is a LoRA trained for the Wan2.1 14B video generation model, suitable for text-to-video and image-to-video tasks.
Video Processing Supports Multiple Languages
Q
mam33
25
0
Liquid V1 7B
MIT
Liquid is an autoregressive generation paradigm that achieves seamless fusion of visual understanding and generation by tokenizing images into discrete codes and learning these code embeddings alongside text tokens in a shared feature space.
Text-to-Image Transformers English
L
Junfeng5
11.35k
84
Molmo 7B D 0924 NF4
Apache-2.0
The 4Bit quantized version of Molmo-7B-D-0924, which reduces VRAM usage through the NF4 quantization strategy and is suitable for environments with limited VRAM.
Image-to-Text Transformers
M
Scoolar
1,259
1
Mini Image Captioning
Apache-2.0
A lightweight image captioning model based on bert-mini and vit-small, weighing only 130MB, with extremely fast performance on CPU.
Image-to-Text Transformers English
M
cnmoro
292
3
Janus Pro 1B ONNX
MIT
Janus-Pro-1B is a multimodal causal language model that supports various tasks such as text-to-image and image-to-text.
Text-to-Image Transformers
J
onnx-community
3,010
47
Longva 7B TPO
MIT
LongVA-7B-TPO is a video-text model derived from LongVA-7B through temporal preference optimization, excelling in long video understanding tasks.
Video-to-Text Transformers
L
ruili0
225
1
Hunyuanvideo HFIE
Other
Tencent Hunyuan Video is a text-to-video generation model, compatible with Hugging Face inference endpoints.
Text-to-Video English
H
jbilcke-hf
21
1
Instructcir Llava Phi35 Clip224 Lp
Apache-2.0
InstructCIR is an instruction-aware contrastive learning-based compositional image retrieval model, utilizing ViT-L-224 and Phi-3.5-Mini architectures, focusing on image-text-to-text generation tasks.
Image-to-Text
I
uta-smile
15
2
Captain Eris Violet V0.420 12B
Other
Captain Violet is a 12B-parameter merged model, created by combining Epiculous/Violet_Twilight-v0.2 and Nitral-AI/Captain_BMO-12B using the mergekit tool, supporting text generation tasks.
Large Language Model Transformers English
C
Nitral-AI
445.12k
41
Cogvideox 2B LiFT
MIT
CogVideoX-2B-LiFT is a text-to-video generation model fine-tuned from CogVideoX-1.5 using reward-weighted learning methods
Text-to-Video English
C
Fudan-FUXI
21
1
Llama 3.2 11B Vision Radiology Mini
Apache-2.0
Vision instruction fine-tuned model optimized with Unsloth, supporting multimodal task processing
Text-to-Image Transformers English
L
mervinpraison
39
2
Thaicapgen Clip Gpt2
An encoder-decoder model based on CLIP encoder and GPT2 architecture for generating Thai image descriptions
Image-to-Text Other
T
Natthaphon
18
0
Janus 1.3B ONNX
Other
Janus-1.3B is a multimodal causal language model that supports text-to-image, image-to-text, and image-text-to-text conversion tasks.
Text-to-Image Transformers
J
onnx-community
123
15
Omnigen V1
MIT
OmniGen is a unified image generation model that supports multiple image generation tasks.
Image Generation
O
Shitao
5,886
309
Emu3 Stage1
Apache-2.0
Emu3 is a multimodal model developed by the Beijing Academy of Artificial Intelligence, trained solely by predicting the next token, supporting image, text, and video processing.
Text-to-Image Transformers
E
BAAI
1,359
26
Sd15.ip Adapter.plus
Apache-2.0
An image-to-image adapter based on IP-Adapter technology for the Stable Diffusion SD1.5 model, supporting artistic image generation via image prompts.
Image Generation Other
S
refiners
112
0
Aim Xlarge
MIT
AiM is an unconditional image generation model based on PyTorch, integrated and pushed to Hugging Face Hub via PytorchModelHubMixin.
Image Generation
A
hp-l33
23
5
Cogflorence 2.2 Large
MIT
This model is a fine-tuned version of microsoft/Florence-2-large, trained on a 40,000-image subset of the Ejafa/ye-pop dataset, with annotation texts generated by THUDM/cogvlm2-llama3-chat-19B, suitable for image-to-text tasks.
Image-to-Text Transformers Supports Multiple Languages
C
thwri
20.64k
33
Lumina Mgpt 7B 1024
Lumina-mGPT is a family of multimodal autoregressive models, excelling in generating flexible and realistic images from text descriptions and capable of performing various vision and language tasks.
Text-to-Image
L
Alpha-VLLM
27
9
Lumina Mgpt 7B 768
Lumina-mGPT is a family of multimodal autoregressive models, excelling in generating flexible and realistic images from text descriptions, and capable of performing various vision and language tasks.
Text-to-Image Transformers
L
Alpha-VLLM
1,944
33
Lumina Mgpt 7B 768 Omni
Lumina-mGPT is a series of multimodal autoregressive models, excelling in generating flexible and realistic images from text descriptions.
Text-to-Image Transformers
L
Alpha-VLLM
264
7
Cogflorence 2.1 Large
MIT
This model is a fine-tuned version of microsoft/Florence-2-large, trained on a subset of 40,000 images from the Ejafa/ye-pop dataset, with annotations generated by THUDM/cogvlm2-llama3-chat-19B, focusing on image-to-text tasks.
Image-to-Text Transformers Supports Multiple Languages
C
thwri
2,541
22
Latte 1
Apache-2.0
Latte is a Transformer-based latent diffusion model focused on text-to-video generation tasks, supporting pre-trained weights for multiple datasets.
Text-to-Video
L
maxin-cn
1,027
19
Shotluck Holmes 1.5
Apache-2.0
Shot2Story-20K is an image-to-text generation model capable of converting input images into coherent textual descriptions or stories.
Image-to-Text Transformers English
S
RichardLuo
158
3
Vit Base Patch16 224 Turkish Gpt2
Apache-2.0
This is a vision encoder-decoder model based on ViT and Turkish GPT2 for generating Turkish image descriptions.
Image-to-Text Transformers Other
V
atasoglu
20
2
VLM WebSight Finetuned
Apache-2.0
This model converts screenshots of website components into HTML/CSS code, developed based on an early checkpoint of a vision-language foundation model
Image-to-Text Transformers Supports Multiple Languages
V
HuggingFaceM4
611
184
I2vgen Xl
MIT
An open-source video synthesis codebase developed by Alibaba's Tongyi Lab, integrating multiple advanced video generation models
Text-to-Video
I
ali-vilab
4,252
172
Sharecaptioner
ShareCaptioner is an open-source image description generation model. It is based on the improved InternLM-Xcomposer-7B base model and fine-tuned on the ShareGPT4V dataset assisted by GPT4-Vision. It can generate high-quality image descriptions.
Image-to-Text Transformers
S
Lin-Chen
401
56
Textdiffuser2 Layout Planner
MIT
TextDiffuser-2 is a text-to-image generation model focused on text rendering tasks, leveraging the potential of language models to generate images containing text.
Text-to-Image
T
JingyeChen22
337
5
Text To Image
This model can generate high-quality images based on input text descriptions, suitable for various scenarios such as creative design and content creation.
Text-to-Image
T
sairajg
848
16
Image Caption Using ViT GPT2
Apache-2.0
This is an image captioning model based on Vision Transformer (ViT) and GPT2 architectures, capable of generating natural language descriptions for input images.
Image-to-Text Transformers
I
Ayansk11
15
1
Biomedgpt LM 7B
Apache-2.0
BioMedGPT-LM-7B is the first large-scale generative language model in the biomedical field based on Llama2, specializing in biomedical text generation and Q&A tasks.
Large Language Model Transformers
B
PharMolix
485
72
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase